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ABSTRACT 

Non-coding (nc)RNAs are important structural and 
regulatory molecules. Accurate determination of the 
primary sequence and secondary structure of ncR- 
NAs is important for understanding their functions. 
During cDNA synthesis, RNA 3' end stem-loops 
can self-prime reverse transcription, creating RNA- 
cDNA chimeras. We found that chimeric RNA-cDNA 
fragments can also be detected at 5' end stem- 
loops, although at much lower frequency. Using the 
Gubler-Hoffman method, both types of chimeric frag- 
ments can be converted to cDNA during library con- 
struction, and they are readily detectable in high- 
throughput RNA sequencing (RNA-seq) experiments. 
Here, we show that these chimeric reads contain 
valuable information about the boundaries of ncR- 
NAs. We developed a bioinformatic method, called 
Vicinal, to precisely map the ends of numerous fruit- 
fly, mouse and human ncRNAs. Using this method, 
we analyzed chimeric reads from over 100 RNA-seq 
datasets, the results of which we make available for 
users to find RNAs of interest. In summary, we show 
that Vicinal is a useful tool for determination of the 
precise boundaries of uncharacterized ncRNAs, fa- 
cilitating further structure/function studies. 

INTRODUCTION 

Non-coding RNAs (ncRNAs) are functional RNA 
molecules that are not translated into proteins. Many cate- 
gories of ncRNAs have been discovered and characterized. 
These include RNAs that carry out basic cellular functions 
such as pre-mRNA splicing (small nuclear RNAs, snRN As) 
and mRNA translation (tRNAs and rRNAs) (1). Also 
included are the small nucleolar (sno)RNAs and small 
Cajal body (sca)RNAs that guide post-transcriptional 
modification of rRNAs and snRNAs, respectively (1). Not 
only are ncRN A components of the core gene expression 



machinery, but they are also involved in multiple aspects 
of genetic regulation. This latter feature has been widely 
recognized with the discovery of microRNAs, siRNAs, 
piRNAs, IncRNAs, etc. (2). The regulatory activities 
of the ncRNAs include roles in chromatin remodeling, 
transcription, splicing, translation, RNA stability and even 
the stability and translocation of proteins (1-5). These 
functions usually depend upon their primary sequence 
and secondary structure in order to mediate interactions 
with proteins and other nucleic acids. Therefore, accurate 
determination of the RNA primary sequence is important 
for subsequent functional studies. 

The rapid development in experimental and computa- 
tional methodologies has significantly increased our abil- 
ity to identify and study new ncRNAs. High-throughput se- 
quencing of the transcriptome (RNA-seq) has been widely 
used for its high sensitivity and nucleotide resolution, and 
revealed hundreds to thousands of short and long ncR- 
NAs in organisms from all three domains of life (6-8). De 
novo predictions based on evolutionary conservation and 
thermodynamic folding have also identified large numbers 
of ncRNAs and structured RNA elements in the genome 
(9,10). However, these methods do not provide enough 
resolution to accurately define the ends of the ncRNAs 
(11), and ends of the most ncRNAs are not well defined. 
Traditional methods of RNA end determination, such as 
5' RACE (Rapid Amplification of cDNA Ends) and 3' 
RACE (rapid amplification of cDNA ends), although accu- 
rate, are labor-intensive and suffer from very low through- 
put (12,13). More advanced high-throughput experimental 
methods have been developed recently to map RNA ends, 
e.g. (14,15), but many of these methods are complicated 
and/or require the presence of poly(A) tails. In addition, 
new ways of analyzing the vast amount of existing RNA- 
seq data will be cost-effective and useful for gaining insights 
into various aspects of RNA structure and processing. 

The traditional method for preparing cDNA libraries 
was developed by Gubler and Hoffman (16), which uses re- 
verse transcriptase for first strand cDNA synthesis, RNase 
H, Escherichia coli DNA polymerase I and DNA ligase 
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for second strand synthesis. This method is also com- 
monly used for RNA-seq library preparation. Within cer- 
tain RNA-seq datasets whose libraries were prepared us- 
ing the Gubler-Hoffman method, we have discovered that a 
large number of the 'unmappable' reads are chimeric. That 
is, these reads consist of two parts: one from the 5' or 3' 
end of the RNA, and the other from an internal region of 
the RNA, on the opposite strand. This phenomenon clearly 
suggests self-priming from the 3' end stem-loop, or ligation 
of the 5' end stem-loop during cDNA library preparation. 
Using the chimeric reads from existing datasets, we devel- 
oped a program, called Vicinal, to precisely determine the 
boundaries of ncRNAs and provide support for the pre- 
dicted terminal stem-loops. 

MATERIALS AND METHODS 

Total RNA-seq of fruit fly larvae, pupae and pharate adults 

Total RNA was extracted from third instar larvae, pupae 
and pharate adult flies and treated with DNase I to remove 
DNA contamination. Ribosomal RNAs were removed 
from the samples using the Ribo-Zero Human/Mouse/Rat 
kit (Epicentre). A TruSeq RNA Sample Preparation Kit v2 
(Illumina) was used for barcoding, multiplexing and cDNA 
library preparation. The TruSeq procedure first fragments 
the rRNA-depleted samples and performs first strand syn- 
thesis using reverse transcriptase and random primers. The 
second strand synthesis uses DNA polymerase I and RNase 
H. The cDNA fragments then go through an end repair 
by adding a single adenosine at the ends. Adapters are 
ligated after repair. Paired end (2 x 48) sequencing was 
performed on an Illumina HiSeq 2000 platform. The data 
were deposited in the Gene Expression Omnibus with ac- 
cession number GSE50711, and named FlyJarva3_48nt, 
Fly_pupa_48nt and Fly_pharate_48nt. 

Additional RNA-seq datasets 

Additional RNA-seq data used in this study were gen- 
erated by our lab and others. They are listed as follows, 
together with their Short Read Archive accession num- 
bers. Here we also briefly describe library preparation 
methods used to obtain these data, to help under- 
stand the Vicinal methodology. Fly_ovary_RIP_35nt: 
SRR120120-SRR120139 and SRR287104-SRR287107 (24 
datasets) (17). Fly_S2_45nt: SRR345574-SRR345591 (18 
datasets) (18). Mouse_ES_40nt: SRR392624-SRR392626 
(3 datasets) (18). Mouse_ES_51nt: SRR915881-SRR915888 
and SRR941123-SRR941140 (26 datasets) (19). 
Mouse_satellite_50nt: SRR953246 (1 dataset) (20). Hu- 
man_HCT116_50nt: SRR901290-SRR901292 (3 datasets) 
(21). 

The fly_ovary_RIP_35nt libraries were described previ- 
ously (17). Briefly, Sm protein containing ribonucleoprotein 
(RNP) complexes from Drosophila ovaries were immuno- 
precipitated using anti-Sm or anti-GFP (green fluorescent 
protein) antibodies and the associated RNAs were purified. 
No polyA selection or rRNA removal was performed 
on the immunopurified RNA. First strand synthesis was 
carried out using a Superscript III kit (Life Technologies). 
Second strand synthesis was performed using E. coli DNA 



polymerase I and RNase H (Life Technologies). Double- 
stranded cDNAs were made into libraries and sequenced 
using Illumina Genome Analyzer II. The fly_larva3_48nt, 
fly_pupa_48nt RNA-seq datasets were generated the same 
way as the fly_pharate_48nt described above. The Shi- 
latifard lab generated the fly_S2_45nt, mouse _ES_40nt, 
mouse_ES_51nt and human_HCT_116_50nt RNA-seq 
datasets (18,19,21). Ribosomal RNA was removed from 
two micrograms of DNase -treated total RNA using the 
Ribo-Zero kit from Epicentre, and libraries were made 
using the Tru-seq mRNA kit from Illumina. The generation 
of mouse_satellite_50nt dataset by the Sartorelli lab also 
followed a similar protocol (20). 

Bioinformatic pipeline 

To obtain a rough estimate of the sequencing coverage and 
length of ncRNAs, the RNA-seq reads were mapped to 
reference genomes using Bowtie, allowing a maximum of 
two mismatches (end-to-end mapping) (22,23). Splicing was 
not considered in read mapping as only intronless ncR- 
NAs were investigated in this study. Then the same raw 
reads were also mapped to genome references using Bowtie2 
in the sensitive -local mode, which allows softclipping. The 
Bowtie- and Bowtie2 -mapped reads were used to make bed- 
graph files for visualization in a genome browser. These bed- 
graph tracks were only shown in the analysis of snRNA:Ul. 

In order to identify self-priming and ligation events, the 
Bowtie2 mappable reads were filtered (using samsoftfil- 
ter.py, see the software package and instructions therein) to 
select reads that are only partially mapped to the genome, 
leaving at least n nucleotides from either end that are not 
mappable (n > 5). After filtering, the unmappable parts 
of the partially mappable reads were mapped again to 
the vicinity of the mappable parts (using VicinaLl.O.py 
and Vicinal_2.0.py). This step generates a SAM (sequence 
alignment /map) file of chimeric reads and two wiggle files 
containing the coverage data for both the plus and mi- 
nus strands. To make the method easy to use, we prepared 
a complete set of command line instructions. The scripts 
and instructions are available for download from the fol- 
lowing website: https://sites.google.com/site/zhipeng0426/ 
programming. 

Implementation of the Vicinal algorithm 

Samsoftfilter.py parses the CIGAR information from the 
input SAM file to find reads that have at least one soft- 
clipped region longer than a defined value (default n > 
5). Shorter softclipped regions (« < 5) could be sequenc- 
ing errors or other kinds of chimeric sequences and there- 
fore not considered in subsequent analysis. The softclipped 
reads were processed using VicinaLl.O and Vicinal JZ.O.py 
scripts. The purpose of the VicinaLl.O and Vicinal_2.0.py 
script is to map the unmappable regions of the partially 
mapped reads to the vicinity of the mappable parts, on 
the opposite strand. We term this process 'vicinal mapping' 
to distinguish it from the Bowtie2 terminology of 'local 
mapping'. VicinaLl.O.py stores an initialized dictionary for 
fast processing but is only appropriate for genomes with 
smaller chromosomes, e.g. fly and nematode, whereas Vic- 
inaE2.0.py does not store an initialized dictionary, and is 
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slower, but can be used for genomes with larger chromo- 
somes such as mouse and human. In order to map the soft- 
clipped reads efficiently and minimize memory footprint, 
the reads were sorted by chromosomal position. Once the 
reads mapped to one chromosome are processed, the results 
are written to output files (file_prefix_chim.sam for chimeric 
reads, file_prefix_l .wig and file _prefix_2. wig for coverage). 
Prior to vicinal mapping, the CIGAR code from the soft- 
clipped SAM file is parsed, and accordingly each read is di- 
vided into two or three parts, S+M, M+S or S+M+S, de- 
pending on whether there are softclipped fragments on the 
5' and/or 3' end, where S represents 'softclipped' and M rep- 
resents 'matched.' Internal mismatches were ignored. A re- 
gion around each mapped fragment (with a radius defined 
by the users, default is 100 nt) was extracted from the refer- 
ence genome on the opposite strand. Then the softclipped 
fragments were searched against the extracted region and a 
total of one match is allowed. Once a match is found, the 
record for that read is output to a SAM file, and the coor- 
dinates of the mapped fragments on both strands were cal- 
culated and output to the two wiggle files. The wiggle files 
can be further converted to smaller gzipped bedgragh files 
for efficient storage and transfer. See detailed instructions in 
the Vicinal software package: https://sites.google.com/site/ 
zhipeng0426/programming. 

ncRNA lists and generation of lists with chimeric read num- 
bers 

Lists of fruit fly, mouse and human ncRNA co- 
ordinates used in the analysis were generated as 
follows. The Drosophila ncRNA list was down- 
loaded from UCSC Genome Bioinformatics site 
(http://hgdownload.soe.ucsc.edu/goldenPath/dm3/ 
database/), matched with gene names from Flybase 
(http://flybase.org/static_pages/downloads/COORD.html) 
and rearranged according to the format described in the 
Vicinal software. The mouse ncRNA list was downloaded 
from the mouse genome informatics (MGI) database at 
Jackson Laboratory (ftp://ftp.informatics.jax.org/pub/ 
reports/MGLMRK.Coord.rpt). The human ncRNA list 
was downloaded from Ensembl (ftp://ftp.ensembl.org/ 
pub/release- 74/fasta/homo_sapiens/ncrna/) and rearranged 
accordingly. The number of chimeric reads for each 
ncRNA can be obtained using the readnum.sh script, 
which depends on the samtools package (24). 

Northern blotting 

S2 cells were homogenized in TRIzol (Life Technologies) 
and total RNA was extracted following manufacturer's 
instructions. RNA was electrophoresed in 4-12% TBE- 
Urea polyacrylamide gels (Life Technologies), transferred 
to nylon membranes and probed with 32 P-labeled poly- 
merase chain reaction (PCR) products corresponding to the 
Drosophila melanogaster U2 and LU snRNA cDNAs. 

RNA secondary structure prediction 

The secondary structures of non-coding RNAs were pre- 
dicted using either UNAfold or the Vienna RNA Pack- 



age with default parameter settings (25,26). Alternative sec- 
ondary structures (conformers) were occasionally adjusted 
manually to fit the chimeric reads. Structured alignments 
of ncRNAs were performed using LocARNA (global stan- 
dard alignment ) (27). Secondary structures of the predicted 
RNAs were drawn using VARNA (28). 

RESULTS 

Previously, we carried out an RNA-immunoprecipitation 
sequencing (RIP-seq) analysis to identify RNAs that co- 
purify with Sm proteins in Drosophila and human cells 
(17). During preparation of the sequencing libraries, we 
used either oligo-dT or random hexamer primers for first 
strand cDNA synthesis from RNA. Curiously, we found 
that both random and oligo-dT primed libraries contained 
large numbers of snRNA transcripts. This latter result was 
unexpected because snRNAs are not polyadenylated. To ex- 
plain this observation, we considered the possibility that 
the snRNA reads detected in the oligo-dT primed libraries 
might be oligoadenylated RNA degradation intermediates 
(29). However, manual inspection of reads derived from 
snRNA 3' ends did not reveal oligo(A) extensions in ei- 
ther the oligo-dT or random hexamer primed libraries; in- 
stead, the (non-templated) extensions appear to be products 
of self-priming from stem-loop sequences that are typically 
present at the 3' ends of snRNAs. Though much less fre- 
quent, we also found reads that contain 5' extensions, which 
might be the result of ligation events between cDNA and 
RNA 5' stem-loops. A diagram of possible mechanisms for 
the generation of 5' end and 3' end chimeric reads is pre- 
sented in Figure 1. The 3' end stem-loop can serve as a 
primer for first strand cDNA synthesis. cDNA fragments 
that are close to the 5' end stem-loop can be ligated with the 
5' end RNA by DNA ligase. The two types of DNA-RNA 
chimera could, in principle, serve as templates for second 
strand synthesis. The resultant double-stranded DNA could 
then be further ligated with adapters and sequenced. The 5' 
cap structures present at many ncRNAs provide one expla- 
nation for the low efficiency of the 5' ligation. However, it is 
not entirely clear how DNA polymerase I uses DNA-RNA 
chimeras as templates for second strand synthesis. The dia- 
gram in Figure 1 is provided for illustrative purposes to help 
understand how chimeric reads might be generated. 

The Vicinal algorithm 

Initial examination of the chimeric reads derived from snR- 
NAs reveals two important features, irrespective of whether 
they arose via self-priming or ligation. First, the two parts 
of each chimeric read usually map close to one another, 
within 100 nt. This distance is basically determined by the 
size of the terminal stem-loop. Second, the two parts of each 
chimera map to opposite strands of the encoding DNA, un- 
like reads derived from spliced RNAs, which map to the 
same strand. Based on these properties, we developed an 
analysis pipeline to identify reads that are derived from self- 
priming and ligation events (Figure 1). 

For vicinal mapping to work, the RNA-seq libraries must 
be prepared in a way that allows for self-priming. This is 
usually accomplished by cDNA synthesis prior to adapter 
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Figure 1. Vicinal pipeline and possible mechanisms for the generation of chimeric reads. (A) Flowchart of the analysis pipeline. The RNA-seq libraries are 
prepared such that they do not exclude ncRNAs, and the reverse transcription step precedes adapter ligation. The RNA-seq reads are first aligned to the 
genome using Bowtie2 in the local-mapping mode (-sensitive-local). Then partially mapped reads are selected and vicinally mapped using Vicinal. Bowtie2 
mapped reads are directly used to roughly estimate the boundaries of the ncRNAs, while the Vicinal mapped reads are used to determine the boundaries. 
Finally, the secondary structure is predicted and adjusted to fit the chimeric reads. The convention of colors is consistent throughout all the figures. (B) 
Many ncRNAs in solution adopt secondary structures with terminal stem-loops. During reverse transcription, the 3' end stem-loop can serve as a primer, 
in addition to primers added to the solution, for cDNA synthesis (red and blue lines with arrows are cDNA fragments). After cDNA synthesis, the cDNA 
fragments close to the 5' end stem-loop can be ligated to the stem-loop. The 5' end and 3' end cDNA chimeras can further serve as templates for second 
strand DNA synthesis, thus producing cDNA fragments for subsequent adapter ligation and deep sequencing. 



ligation, on RNA samples that contain ncRNAs (Figure 
1A). Because chimeric reads represent only a small por- 
tion of the total number of reads (see read mapping statis- 
tics in Table 1, and Figure 2 A and B for an example), ef- 
ficient processing of raw RNA-seq data is important for 
subsequent analysis. We used Bowtie2 for preliminary map- 
ping, because it is fast and allows softclipping of the reads 
for local (partial) mapping (23). The Bowtie2 mapping re- 
sults provide a rough estimate of the coverage and size of 
transcripts. For the locally (partially) mapped reads, only 
the longer segment of the read is mapped to the genome, 
whereas the shorter segment is softclipped/ignored. 

After the initial mapping step, we filtered the mapped 
reads to select those with at least one softclipped segment 
longer than a defined size (e.g. 5 nt). The size of the soft- 
clipped segment is chosen so that the fragment can be 
uniquely mapped in the vicinity of the Bowtie2 mapped part 
of the read. The softclipped segments are then mapped 'vic- 
inally', that is, mapped to a region within a certain distance 
(e.g. 100 nt) from the mapped segment, on the opposite 
strand (Figure 1A). 

Once both segments are mapped, the junction is used to 
define the ends of the ncRNA, and terminal stem-loops in 
the predicted secondary structures are used to explain the 
source of the chimeric reads (Figure 1A and B). However, 
the presence of self-primed and ligated reads does not im- 
ply that the chimera-generating terminal stem-loops are sta- 
ble in vivo. It is only evidence for the presence of the termi- 
nal stem-loops in solution, likely in equilibrium with other 
conformations (see Figures 2H and 3E, H, J and L). We 



generated lists of ncRNA genomic coordinates and used 
them to intersect with chimeric reads generated by Vicinal 
to make lists of ncRNAs with numbers of chimeric reads 
(see instructions and the results: https://sites.google.com/ 
site/zhipeng0426/programming ) . 

We have analyzed hundreds of RNA-seq datasets us- 
ing Vicinal and found 1 1 5 of them containing self-primed 
and ligated chimeric reads. These datasets were sorted 
into nine different categories, according to organism, tis- 
sue and/or read length (Table 1). The data were gener- 
ated by several different laboratories, demonstrating that 
such chimeric reads are not specific artifacts of a single 
lab Five of the nine categories are sourced from the fruit- 
fly, three from the mouse and one from human. Statis- 
tics of the Bowtie2 local mapping, filtering and Vicinal 
mapping are presented. Although the fraction of chimeric 
reads is not high in any of the datasets, there are enough 
of them to determine the ends of many ncRNAs. Given 
the large number of starting raw reads in most RNA-seq 
experiments, Vicinal analysis provides users with numer- 
ous ncRNAs with sufficient chimeric read coverage. These 
include snRNAs, snoRNAs, scaRNAs, 5.8S rRNA, 7SK 
RNA, 7SL RNA, RNaseP RNA, RNaseMRP RNA, etc. 
(see Figures 2 and 3, Supplementary Figures SI and S2 
and ncRNA lists from our website: https://sites.google.com/ 
site/zhipeng0426/programming). The chimeric read cover- 
age for these ncRNAs varies greatly from dozens to thou- 
sands of reads per RNA. However, certain types of ncR- 
NAs, e.g. IncRNAs, miRNAs, siRNAs, etc., do not typi- 
cally have chimeric reads and thus the Vicinal program is 
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Figure 2. Vicinal analysis of known fly ncRNAs, snRNA:Ul (A-E) and RNaseP:RNA (F-L). (A) Five genome browser tracks for snRNA:Ul:95Cc are 
shown from the analysis of fly_ovary_RIP_35nt datasets: btl_E2E (Bowtiel end-to-end mapping of raw reads), bt2_local (Bowtie2 local mapping), soft 
(selecting softclipped reads from bt2_local), plus and minus (reads mappable by Vicinal). The thick blue bar represents the mature Ul:95Cc transcript 
region. The vertical dashed blue lines align the 5' and 3' ends of the Ul transcript. Note that there are five functional Ul snRNA genes in the fly genome 
and only one is shown here. The high peaks mapped to the middle of Ul are artifacts from mapping short softclipped parts of reads. (B) Filtering reads 
from fly .ovary _35nt_plus/minus for reads close to the estimated Ul snRNA ends showed clear terminal pileup of reads. Reads for all five Ul snRNA 
paralogs were combined. Note the difference in scale for the 5' end and 3' end chimeric reads. (C) Detailed analysis of the chimeric reads that map to both 
ends. The first line is the genomic DNA sequences around the 5' and 3' ends. Subsequent lines are the manually aligned chimeric reads, where black letters 
represent parts mapped to the ends of the transcript, blue letters represent 3' extensions mapped to the internal region on the opposite strand, and the 
red letters represent 5' extensions mapped to the internal region on the opposite strand. The numbers before each chimeric read sequence are read counts. 
Note the differences between the extended genomic DNA and the terminal extensions in the chimeric reads. Only the top 10 groups of distinct reads are 
shown for the 3' end, and top 6 groups of distinct reads for the 5' end. (D) Predicted secondary structure that explains the production of the chimeric reads. 
The black lines represent parts of reads mapped to the ends of Ul snRNA, whereas the red and blue lines represent terminal extensions mapped to the 
internal regions of Ul snRNA. (E) Potential equilibrium in solution between the chimera-generating secondary structure (on the left, the same as in D) 
and the well-known physiological secondary structure in Ul snRNP (on the right). The normal secondary structure is unlikely to give rise to 5' end ligated 
reads due to the long 5' overhang. (F) Ten genome browser tracks for RNaseP:RNA are shown from the Vicinal analysis of five groups of fly RNA-seq 
data. For simplicity, the end-to-end mapping, local mapping and softclipped read tracks are not shown. Note the terminally adjusted read pileups; they are 
not filtered as in B for Ul snRNA. The 5' end of the RNaseP:RNA is on the right. Chimeric reads were combined from all five groups of RNA-seq data, 
for subsequent detailed analysis. (G) Detailed analysis of the chimeric reads that map to 3' end of RNaseP:RNA. (H) The chimera-generating secondary 
structure of the 3' end of RNaseP:RNA. (I and K) Detailed analysis of reads mapped the 5' end reveals two possible 5' ends that differ by four nucleotides. 
(J and L) The chimera-generating secondary structures of the 5' end of RNaseP:RNA. Note, the secondary structures shown here for the 5' end and 3' 
end are different from the physiological secondary structure of the RNaseP RNP. 



not applicable. These results demonstrate the utility of our 
approach in the analysis of multiple categories of ncRNAs, 
with a wide range of expression levels, in different organ- 
isms. Here we show several examples of using Vicinal to 
analyze several known and newly discovered fly ncRNAs. 



More examples of Vicinal analysis on fly, mouse and hu- 
man ncRNAs are presented in Supplementary Figures SI 
and S2. 
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Table 1. RNA-seq datasets used in the study 

Read 



Sample 


Reference 


length 


Mappable 


Softclipped 


Chimeric 


% Chimeric 


fly_ovary_RIP 


Lu et al. (2014) (17) 


35 


87594638 


11334332 


990969 


1.13 


fly_pharate 


This study 


48 


124544603 


13919998 


226981 


0.18 


fly_S2 


Smith etal. (2011) (18) 


45/50 


224898608 


18788217 


387406 


0.17 


fly_larva3 


This study 


48 


241674561 


25221457 


1294206 


0.53 


fly_pupa 


This study 


48 


129388545 


12737427 


182526 


0.14 


mouse_ES 


Smith et al. (2011) (18) 


40 


101615022 


5573492 


93137 


0.09 


mouse_ES 


Hue/of (2013) (19) 


51 


712409456 


53374271 


438910 


0.06 


mouse_satellite 


Mousavi et al (2013) (20) 


50 


41019420 


1910902 


57034 


0.14 


human_HCT116 


Hue/ al. (2013) (21) 


50 


90190285 


6695509 


123856 


0.14 



Confirmation of known snRNA:Ul and RNaseP:RNA ends 
using Vicinal 

Most snRNAs have stem-loops at their 5' and 3' ends with 
very short overhangs, and their sequences and secondary 
structures are well characterized. Sm protein immunopre- 
cipitations enrich for snRNAs, and among them, Ul is the 
most abundant. Therefore, we first analyzed chimeric reads 
derived from Ul snRNA as a proof-of-principle (Figure 
2A-E). The chimeric reads for Ul are most abundant in the 
fly_ovary_RIP_35nt sample and therefore we only showed 
this group of RNA-seq data in the Vicinal analysis of Ul. 

The read coverage patterns for Ul snRNA, using Bowtie 
(end-to-end mapping) and Bowtie2 (local mapping), are 
not uniform (Figure 2A, and for other ncRNAs, data not 
shown). This is especially true near the two ends, because 
untemplated extensions in the reads are not mappable, 
and the priming and sequencing efficiency along Ul varies 
according to sequence and structural contexts. The non- 
uniformity of the read coverage makes estimation of tran- 
script size difficult. However, selection of the Bowtie2 lo- 
cally (partially) mapped reads clearly shows that many of 
the mappable fragments are justified to the left or right ends, 
suggesting the existence of softclipping in Bowtie2 mapping 
(Figure 2A and B). Vicinal mapping of the chimeric reads 
places the unmapped segments on the opposite strands 
(Figure 2A and B) and the terminally mapped half-reads in- 
dicate the presence of terminal stem-loops (Figure 2C). Pat- 
terns of read coverage showed clear end justification (Figure 
2A and B, the dashed blue line, see also examples in subse- 
quent figures). 

Detailed alignment of the partially mapped Ul reads 
showed clear signs of chimera formation (Figure 2C). The 
presence of terminal overhangs and imperfect complemen- 
tarity in the stem allows for definition of the boundaries. 
Importantly, the abundance of the 3' end-derived chimeric 
reads confirmed the stable 3' end stem-loop which allows 
for efficient self-priming, despite the presence of imperfect 
complementarity. In fact, the presence of base pair mis- 
matches made it possible to define the ends with near single- 
nucleotide resolution. The identity of the few additional nu- 
cleotides (usually 1-2 nucleotides) close to the end of the 
mature transcript may interfere with the accuracy of end 
determination, but most of the time, they are short enough 
to allow near-nucleotide determination. In contrast to the 
3' end reads, there were many fewer reads derived from the 
Ul 5' end. The relative dearth of 5' end chimeric reads is 
likely due to the fact that the 5' overhang in the predicted 



secondary structure is quite long (~11 nt) and that liga- 
tion to first strand cDNA is likely to be very inefficient, 
due to the presence of the trimethylguanosine (TMG) cap. 
Because cDNA library construction takes place on purified 
RNAs and not on stable RNPs, Ul may well adopt alterna- 
tive secondary structures in solution. One such alternative 
Ul structural isomer (see Figure 2E) has no overhang and 
might be a better substrate for generation of chimeric 5' end 
reads. Irrespective of the mechanism, the structure of the 
observed Ul chimeric reads is consistent with the known 5' 
and 3' ends of Ul snRNA (30). 

RNaseP:RNA is a ribozyme that cleaves pre-tRNA 5' 
end leader sequences during tRNA biogenesis and is an es- 
sential RNA in all life forms. Here we present a detailed 
analysis of RNaseP:RNA ends using Vicinal (Figure 2F- 
L). Chimeric reads are detectable for RNaseP:RNA in all 
five categories of fly RNA-seq data, with varied abundance 
(Figure 2F). We combined chimeric reads from all five of 
these groups, and Vicinal analysis revealed clear terminally 
justified chimeric reads for both 5' and 3' ends. The 3' end 
chimeric reads clearly define a single end, with a maximum 
of two ambiguous nucleotides (Figure 2G and H), consis- 
tent with previous report (31). The 5' ligated reads suggest 
two possible ends, differing by 4 nt (Figure 2I-L). One of 
the two 5' ends is consistent with previous reports (Figure 
21 and J) (31). It is likely that the other one (Figure 2K and 
L) represents a transcript that uses a different transcription 
start site or is subject to alternative 5' processing. The termi- 
nal stem-loops that explain generation of 5' and 3' chimeric 
reads are different from the physiological secondary struc- 
ture (present in the RNaseP RNP particle), but nonetheless 
they are very likely to exist in solution (31). Taken together, 
our method defines the boundaries of two known ncRNAs. 

Vicinal analysis defines boundaries of newly discovered snR- 
NAs and sno/scaRNAs 

In order to show the utility of Vicinal in analysis of novel or 
under-studied ncRNAs, we have examined all Drosophila, 
mouse and human ncRNAs using Vicinal and detected 
chimeric reads in many of them. Here we show two ncR- 
NAs that we discovered in our previous RIP-seq analysis 
(17). Dozens of additional examples are presented in Sup- 
plementary Figures SI and S2. 

(i) snRNA:LU. Like-U is a newly evolved Sm-class 
snRNA (CR43708), present only in Drosophilid 
genomes (17). Vicinal analysis of LU snRNA revealed 
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Figure 3. Vicinal analysis of snRNA:Like-U (LU) (A-D) and scaRNA:Prp8 (E-G). Please refer to Figure 2 for general description of the analysis flow. 

(A) Only two genome browser tracks for snRNA:LU are displayed: plus and minus, whereas the end-to-end mapping, local-mapping and softclipped read 
tracks are not shown. Note the size predicted by Jung et al. (32) (transcript model, the thick blue line) is longer than the size determined by Vicinal analysis. 

(B) Detailed analysis of the chimeric reads. Only the top 10 groups of distinct reads are shown. The 3' ligated reads included variants, because sequencing 
the long stretch of adenosines unavoidably introduces errors. (C) The chimera-generating secondary structure for snRNA:LU. The Sm site is shown on the 
predicted secondary structure. (D) Northern blot of Drosophila U2 and LU snRNAs. (E) Six genome browser tracks for scaRNA:Prp8 are shown from the 
analysis of three groups of RNA-seq datasets, where chimeric reads for this RNA are available. Note that the earliest annotation labeled this ncRNA as 
a snoRNA (as shown in the gene annotation track); however, subsequent research suggests that it is a scaRNA. (F) Detailed analysis of the chimeric reads. 
(G) The chimera-generating secondary structure for scaRNA:Prp8. (H) Potential equilibrium between the more likely physiological secondary structure 
(on the left, with the pseudouridylation pocket open in the last stem-loop in a black box) and the chimera-generating secondary structure (on the right). 



hundreds of terminally justified fragments and inter- 
nally mapped second fragments at both the 5' end 
and 3' end of the transcript (Figure 3A). We analyzed 
these chimeric reads and predicted that the length 
of LU snRNA is 116 nt. The predicted secondary 
structure is shown in Figure 3C. Previously, Jung et al. 



(2010) analyzed publicly available RNA-seq data and 
identified a transcript from this locus, estimating its 
length to be 150-160 nt (32). To resolve this difference 
in length prediction, we performed northern blotting 
of LU snRNA, which showed a size that is consistent 
with our Vicinal analysis (1 10-120 nt, Figure 3D). This 
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size is also consistent with the sequence conservation 
of LU orthologs among Drosophilids (17). 
(ii) scaRNA:Prp8. Another novel Sm-associated ncRNA 
we discovered in our RIP-seq study is scaRNA:Prp8 
(CR43600; Figure 4C). A previous transcriptomic 
study estimated its size to be 178 nt (8). Here, Vici- 
nal analysis of three groups of RNA-seq data revealed 
dozens of terminally justified fragments and predicts a 
length of 168 nt (Figure 3E-H). This size is also con- 
sistent with the alignment of scaRNA orthologs in in- 
sects. The chimeric reads can be explained by an alter- 
native conformation of the secondary structure (Figure 
3G and H, right side). The other conformation (Figure 
3H, left side) is likely to be the physiological one, due to 
the presence of an open pseudouridylation pocket (Fig- 
ure 3H, black box), as reported recently by Deryusheva 
and Gall (33). 

The determination of ncRNA ends following Vicinal 
mapping requires manual alignment of the chimeric reads 
and fitting onto predicted secondary structures. In order 
to make best use of the large amounts of chimeric reads 
mapped using Vicinal from Drosophila, mouse and human 
RNA-seq datasets, we provide them as lists for users to 
identify ncRNAs of their interest. The lists contain ncRNA 
identifiers and numbers of chimeric reads mapped to each 
ncRNA. Chimeric read coverage patterns can be visualized 
by importing the bedgraph track files into genome browsers. 
The chimeric reads for each ncRNA can be extracted from 
the Vicinal mapped BAM (binary alignment/map) files 
and manually aligned. In the future, additional RNA-seq 
datasets can be added to increase the chimeric read cover- 
age on ncRNAs in the species analyzed, and potentially in 
other species as well. 

DISCUSSION 

In this study, we present a new bioinformatic tool, called 
Vicinal, to define the ends of ncRNAs with terminal stem- 
loops. This method takes advantage of the self-priming and 
ligation property of ncRNA 3' and 5' terminal stem-loops 
during library preparation using the Gubler-Hoffman 
method (16), and the power of massively parallel sequenc- 
ing. Using Vicinal, we confirmed the boundaries of previ- 
ously studied ncRNAs and also defined the boundaries of 
newly discovered ncRNAs from many different RNA-seq 
datasets from various species. 

Although other methods are available for the determi- 
nation of ncRNA ends, many of them are labor-intensive 
and require more experiments. Our analysis method makes 
use of published RNA-seq data and is cost-effective. More 
accurate determination of ends for more ncRNAs will be 
available with the publication of ever increasing amount 
RNA-seq data. 

It has long been known that 3' end self-priming of U3 
snoRNA mediates pseudogene formation during the pro- 
cess of retrotransposition (34). Pseudogenes derived from 
other highly structured ncRNAs, including Ul and U2 
snRNAs, are also known to form in this manner. Fur- 
thermore, self-priming from 3' end stem-loops is a rela- 
tively common feature among certain single-stranded RNA 



and DNA viruses (35-37). This self-priming ability is re- 
quired for proper replication of the viral genome. More- 
over, the high efficiency and specificity of self-priming from 
terminal stem-loops has been exploited for quantification 
of small RNA levels by reverse transcriptase-polymerase 
chain reaction (RT-PCR), wherein a stem-loop RT primer 
is used instead of a conventional, unstructured primer (38). 
In contrast to the widespread use of 3' self-priming in na- 
ture, we are not aware of previous findings regarding the 
phenomenon of ligation to 5' end RNA stem-loops dur- 
ing cDNA library construction, and further studies will be 
needed in order to understand the mechanism. 

The use of soft-clipped reads for mapping inevitably cre- 
ates artifacts. Vicinal mapping sometimes assigns reads to 
exon-exon junctions, due to their short length after clip- 
ping (data not shown). Other kinds of artifacts are also ob- 
served, mainly in highly expressed ncRNAs, such as riboso- 
mal RNAs and certain snRNAs (Figure 2A). However, such 
artifacts can be clearly distinguished from chimeras that are 
generated by self-priming and ligation. The latter have dis- 
tinct features, such as terminally justified pileups. We note 
that certain RN A secondary structures are likely to be more 
favorable for chimera formation than others. Although such 
structures are not necessarily the most stable ones in solu- 
tion or in vivo (see Figures 2D, H, J, L and 3G), the bound- 
ary mapping procedure described here can easily pick up 
such low-frequency priming events. 

In summary, the method described above enables highly 
sensitive analysis of ncRNA boundaries. The use of fast 
short-read mappers (we used Bowtie2) in combination with 
rapid local alignment of what would otherwise be consid- 
ered 'unmappable' fragments allows for efficient processing 
of large datasets in a relatively short period of time. Be- 
cause terminal stem-loops and internal single-stranded re- 
gions are common features of many ncRNAs, our method 
should prove useful for a wide variety of studies in RNA 
biology. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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